12  Functions + iteration

12.1 Learning goals:

At the end of this section, you should be able to:

  • Identify the core components of a function definition and explain their role (the function() directive, arguments, argument defaults, function body, return value)
  • Describe the difference between argument matching by position and by name
  • Write if-else, if-else if-else statements to conditionally execute code
  • Write your own function to carry out a repeated task
  • Replicate your function multiple times using map()

12.2 Functions

12.2.1 Why functions?

Getting really good at writing useful and reusable functions is one of the best ways to increase your expertise in data science. It requires a lot of practice.

If you’ve copied and pasted code 3 or more times, it’s time to write a function. Try to avoid repeating yourself.

  1. Reducing errors: Copy + paste + modify is prone to errors (e.g., forgetting to change a variable name)
  2. Efficiency: If you need to update code, you only need to do it one place. This allows reuse of code within and across projects.
  3. Readability: Encapsulating code within a function with a descriptive name makes code more readable.

12.2.2 An example

Consider the following code (taken from https://r4ds.hadley.nz/iteration). What does it do?

df <- tibble(
  a = rnorm(5),
  b = rnorm(5),
  c = rnorm(5),
  d = rnorm(5),
)

df |> mutate(
  a = (a - min(a, na.rm = TRUE)) / 
    (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
  b = (b - min(a, na.rm = TRUE)) / 
    (max(b, na.rm = TRUE) - min(b, na.rm = TRUE)),
  c = (c - min(c, na.rm = TRUE)) / 
    (max(c, na.rm = TRUE) - min(c, na.rm = TRUE)),
  d = (d - min(d, na.rm = TRUE)) / 
    (max(d, na.rm = TRUE) - min(d, na.rm = TRUE)),
)
# A tibble: 5 × 4
      a       b     c     d
  <dbl>   <dbl> <dbl> <dbl>
1 0.707 -0.204  0     0.758
2 0.252  0.796  0.229 0.115
3 0.190  0.235  0.655 0.772
4 1      0.0457 0.157 1    
5 0     -0.0266 1     0    

You might be able to puzzle out that this rescales each column to have a range from 0 to 1. But did you spot the mistake? (Example from R4DS, and…) When Hadley wrote the code he made an error when copying-and-pasting and forgot to change an a to a b. Preventing exactly this type of mistake is one very good reason to learn how to write functions.

The key to the work above is that we want to repeat a set of code multiple times. The code we want to replicate can be written as:

(█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))

where █ represents the part of the code that changes each time the function is run.

12.2.3 Parts of a function

To create a function you need three things:

  1. A name. Here we’ll use rescale01() because this function rescales a vector to lie between 0 and 1.

  2. The arguments. The arguments are things that vary across calls and our analysis above tells us that we have just one. We’ll call it x because this is the conventional name for a numeric vector.

  3. The body. The body is the code that’s repeated across all the calls.

Then you create a function by following the template:

name <- function(arguments) {
  body
}

Which leads to the function:

rescale01 <- function(x) {
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}

At this point you might test with a few simple inputs to make sure you’ve captured the logic correctly:

rescale01(c(-10, 0, 10))
[1] 0.0 0.5 1.0
rescale01(c(1, 2, 3, NA, 5))
[1] 0.00 0.25 0.50   NA 1.00

Then you can rewrite the call to mutate() as:

df |> mutate(
  a = rescale01(a),
  b = rescale01(b),
  c = rescale01(c),
  d = rescale01(d),
)
# A tibble: 5 × 4
      a     b     c     d
  <dbl> <dbl> <dbl> <dbl>
1 0.707 0     0     0.758
2 0.252 1     0.229 0.115
3 0.190 0.439 0.655 0.772
4 1     0.250 0.157 1    
5 0     0.178 1     0    

12.2.4 Ordering and arguments

When calling a function, if you don’t name the arguments, R assumes that you passed them in the order defined inside the function.

my_power <- function(x, y){
  return(x^y)
}

my_power(x = 2, y = 3)
[1] 8
my_power(y = 3, x = 2)
[1] 8
my_power(2, 3)
[1] 8
my_power(3, 2)
[1] 9

12.3 Argument matching

In general, it is safest to match arguments by name and position for your peace of mind. For functions that you are very familiar with (and know the argument order), it’s okay to just use positional matching.

12.3.0.1 Function defaults

my_power <- function(x, y){
  return(x^y)
}

What will happen when I run the following code?

my_power(3)
my_power(3)
Error in my_power(3): argument "y" is missing, with no default
my_power <- function(x, y = 2){
  return(x^y)
}

What will happen when I run the following code?

my_power(3)
my_power(3)
[1] 9
my_power <- function(x, y = 2){
  return(x^y)
}

What will happen when I run the following code?

my_power(2, 3)
my_power(2, 3)
[1] 8
my_power <- function(x = 2, y = 3){
  return(x^y)
}

What will happen when I run the following code?

my_power()
my_power()
[1] 8

12.3.1 Returning a value

average1 <- function(x, remove_nas) {
    sum(x, na.rm = remove_nas)/sum(!is.na(x))
}

average2 <- function(x, remove_nas) {
    return(sum(x, na.rm = remove_nas)/sum(!is.na(x)))
}

average3 <- function(x, remove_nas = TRUE) {
    return(sum(x, na.rm = remove_nas)/sum(!is.na(x)))
    return(sum(x^2, na.rm = remove_nas)/sum(!is.na(x)))
}
some_data <- c(3, NA, 2, 13, 2, NA, 47)

average1(some_data)
Error in average1(some_data): argument "remove_nas" is missing, with no default
average1(some_data, remove_nas = TRUE)
[1] 13.4
average2(some_data)
Error in average2(some_data): argument "remove_nas" is missing, with no default
average2(some_data, remove_nas = TRUE)
[1] 13.4
average3(some_data)
[1] 13.4
  • without return(): the function returns the last value which gets computed and isn’t stored as an object (using <-).

  • with return(): the function will return an object that is explicitly included in the return() call. (Note: if you (accidentally?) have two return() calls, the function will return the object in the first return() call.)

12.4 Control flow

Often inside functions, you will want to execute code conditionally. In a programming language, control structures are parts of the language that allow you to control what code is executed. By far the most common is the if-else if-else structure.

if (logical_condition) {
    # some code
} else if (other_logical_condition) {
    # some other code
} else {
    # yet more code
}
  • Note that inside the curly else brackets, { }, you can have additional lines of code computing objects or conditions, or you can return desired objects.

  • You can include as many } else if { conditions as your problem calls for.

middle <- function(x) {
    mean_x <- mean(x, na.rm = TRUE)
    median_x <- median(x, na.rm = TRUE)
    seems_skewed <- (mean_x > 1.5*median_x) | (mean_x < (1/1.5)*median_x)
    if (seems_skewed) {
        median_x
    } else {
        mean_x
    }
}

Note that (mean_x > 1.5*median_x) | (mean_x < (1/1.5)*median_x) is a TRUE or FALSE question.

some_data <- c(3, NA, 2, 13, 2, NA, 47)

mean(some_data, na.rm = TRUE)
[1] 13.4
median(some_data, na.rm = TRUE)
[1] 3
middle(some_data)
[1] 3

12.4.1 Functions in the tidyverse

Functions that return the same number of rows as the original data frame are good to use inside mutate() and filter(). For example, you might want to capitalize the first word of every string:

first_upper <- function(x) {
  str_sub(x, 1, 1) <- str_to_upper(str_sub(x, 1, 1))
  x
}

first_upper("hello")
[1] "Hello"

Functions that collapse into a single value will work well in the summarize() step of the pipeline. For example, you may want to calculate the coefficient of variation which is the standard deviation divided by the mean.

cv <- function(x, na.rm = FALSE) {
  sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
}

cv(runif(100, min = 0, max = 50))
[1] 0.625
cv(runif(100, min = 0, max = 500))
[1] 0.624

12.4.2 Functions summary

  • Functions can be used to avoid repeating code
  • Arguments allow us specify the inputs when we call a function
  • If inputs are not named when calling the function, R uses the ordering from the function definition
  • All arguments must be specified when calling a function
  • Default arguments can be specified when the function is defined
  • The input to a function can be a function!

12.5 Iterating functions

There will be times when you will need to iterate a function multiple times.

12.5.1 purrr for functional programming

We will see the R package purrr in greater detail as we go, but for now, let’s get a hint for how it works.

We are going to focus on the map family of functions which will just get us started. Lots of other good purrr functions like pluck() and accumulate() and across() from dplyr.

Much of below is taken from a tutorial by Rebecca Barter.

The map functions are named by the output the produce. For example:

  • map(.x, .f) is the main mapping function and returns a list

  • map_df(.x, .f) returns a data frame

  • map_dbl(.x, .f) returns a numeric (double) vector

  • map_chr(.x, .f) returns a character vector

  • map_lgl(.x, .f) returns a logical vector

Note that the first argument is always the data object and the second object is always the function you want to iteratively apply to each element in the input object.

The input to a map function is always either a vector (like a column), a list (which can be non-rectangular), or a dataframe (like a rectangle).

A list is a way to hold things which might be very different in shape:

a_list <- list(a_number = 5,
               a_vector = c("a", "b", "c"),
               a_dataframe = data.frame(a = 1:3, 
                                        b = c("q", "b", "z"), 
                                        c = c("bananas", "are", "so very great")))

print(a_list)
$a_number
[1] 5

$a_vector
[1] "a" "b" "c"

$a_dataframe
  a b             c
1 1 q       bananas
2 2 b           are
3 3 z so very great

Consider the following function:

add_ten <- function(x) {
  return(x + 10)
  }

We can map() the add_ten() function across a vector. Note that the output is a list (the default).

library(purrr)
map(.x = c(2, 5, 10),
    .f = add_ten)

What if we use a different type of input? The default behavior is to still return a list!

data.frame(a = 2, b = 5, c = 10) |>
  purrr::map(add_ten)
$a
[1] 12

$b
[1] 15

$c
[1] 20

What if we want a different type of output? If the function outputs a data frame, we can combine the values in the list using a row-bind, list_rbind().

data.frame(a = 2, b = 5, c = 10) |>
  purrr::map(add_ten) |> 
  list_rbind()
Error in `list_rbind()`:
! Each element of `x` must be either a data frame or `NULL`.
ℹ Elements 1, 2, and 3 are not.

Darn! We get an error because the output of the add_ten() function is a scalar, not a data frame. In order to use list_rbind() we need to edit the add_ten() function.

add_ten_df <- function(x) {
  return(data.frame(x + 10))
}

data.frame(a = 2, b = 5, c = 10) |>
  purrr::map(add_ten_df) |> 
  list_rbind()   # output bound by rows
   x
1 12
2 15
3 20
c(2, 5, 10) |>
  purrr::map(add_ten_df) |> 
  list_rbind()   # output bound by rows
   x
1 12
2 15
3 20
data.frame(a = 2, b = 5, c = 10) |>
  purrr::map(add_ten_df) |> 
  list_cbind()   # output bound by columns
  x...10 x...10 x...10
1     12     15     20
c(2, 5, 10) |>
  purrr::map(add_ten_df) |> 
  list_cbind()   # output bound by columns
  x...1 x...2 x...3
1    12    15    20

Shorthand lets us get away from pre-defining the function (which will be useful). Use the tilde ~ to indicate that you have a function:

data.frame(a = 2, b = 5, c = 10) |>
  purrr::map_df(~{.x + 10})
# A tibble: 1 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1    12    15    20

Mostly, the tilde will be used for functions we already know but want to modify (if we don’t modify, and it has a simple name, we don’t use the tilde):

library(palmerpenguins)
library(broom)

penguins_split <- split(penguins, penguins$species)
penguins_split |>
  purrr::map(~ lm(body_mass_g ~ flipper_length_mm, data = .x)) |>
  purrr::map(tidy) |>   # map(tidy)
  list_rbind()
# A tibble: 6 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        -2536.     965.       -2.63 9.48e- 3
2 flipper_length_mm     32.8      5.08      6.47 1.34e- 9
3 (Intercept)        -3037.     997.       -3.05 3.33e- 3
4 flipper_length_mm     34.6      5.09      6.79 3.75e- 9
5 (Intercept)        -6787.    1093.       -6.21 7.65e- 9
6 flipper_length_mm     54.6      5.03     10.9  1.33e-19
penguins |>
  group_by(species) |>
  group_map(~lm(body_mass_g ~ flipper_length_mm, data = .x)) |>
  purrr::map(tidy)  # map_df(tidy)
[[1]]
# A tibble: 2 × 5
  term              estimate std.error statistic       p.value
  <chr>                <dbl>     <dbl>     <dbl>         <dbl>
1 (Intercept)        -2536.     965.       -2.63 0.00948      
2 flipper_length_mm     32.8      5.08      6.47 0.00000000134

[[2]]
# A tibble: 2 × 5
  term              estimate std.error statistic       p.value
  <chr>                <dbl>     <dbl>     <dbl>         <dbl>
1 (Intercept)        -3037.     997.       -3.05 0.00333      
2 flipper_length_mm     34.6      5.09      6.79 0.00000000375

[[3]]
# A tibble: 2 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        -6787.    1093.       -6.21 7.65e- 9
2 flipper_length_mm     54.6      5.03     10.9  1.33e-19

12.6 Reflection questions

12.7 Ethics considerations